Analyzing Income Level in the Adult Dataset#
Author: Sihong Yao
Course Project, UC Irvine, Math 10, Summer 2023
Introduction#
The provided code is part of a project that focuses on analyzing a dataset related to income levels. The project involves data preprocessing, exploratory data analysis through visualizations, and applying machine learning algorithms such as decision trees, K-nearest neighbors, and logistic regression to predict income levels based on various features.
Data preprocessing#
read in dataset
import pandas as pd
import altair as alt
import numpy as np
# train data
with open('adult.data', 'r') as f:
lines = f.readlines()
columns = 'age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, label'.split(', ')
train = [line.strip().split(', ') for line in lines]
train = pd.DataFrame(train, columns=columns)
# test data
with open('adult.data', 'r') as f:
lines = f.readlines()
columns = 'age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, label'.split(', ')
test = [line.strip().split(', ') for line in lines[1:]]
test = pd.DataFrame(test, columns=columns)
overall look
train.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
test.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 1 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 2 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| 4 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32562 entries, 0 to 32561
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32562 non-null object
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null object
3 education 32561 non-null object
4 education-num 32561 non-null object
5 marital-status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital-gain 32561 non-null object
11 capital-loss 32561 non-null object
12 hours-per-week 32561 non-null object
13 native-country 32561 non-null object
14 label 32561 non-null object
dtypes: object(15)
memory usage: 3.7+ MB
missing values
train.isnull().mean()[train.isnull().mean() > 0]
workclass 0.000031
fnlwgt 0.000031
education 0.000031
education-num 0.000031
marital-status 0.000031
occupation 0.000031
relationship 0.000031
race 0.000031
sex 0.000031
capital-gain 0.000031
capital-loss 0.000031
hours-per-week 0.000031
native-country 0.000031
label 0.000031
dtype: float64
test.isnull().mean()[test.isnull().mean() > 0]
workclass 0.000031
fnlwgt 0.000031
education 0.000031
education-num 0.000031
marital-status 0.000031
occupation 0.000031
relationship 0.000031
race 0.000031
sex 0.000031
capital-gain 0.000031
capital-loss 0.000031
hours-per-week 0.000031
native-country 0.000031
label 0.000031
dtype: float64
# drop those rows with missing values
train = train.dropna()
test = test.dropna()
change data type
train['age'] = train['age'].astype(int)
train['fnlwgt'] = train['fnlwgt'].astype(int)
train['education-num'] = train['education-num'].astype(int)
train['capital-gain'] = train['capital-gain'].astype(int)
train['capital-loss'] = train['capital-loss'].astype(int)
train['hours-per-week'] = train['hours-per-week'].astype(int)
test['age'] = test['age'].astype(int)
test['fnlwgt'] = test['fnlwgt'].astype(int)
test['education-num'] = test['education-num'].astype(int)
test['capital-gain'] = test['capital-gain'].astype(int)
test['capital-loss'] = test['capital-loss'].astype(int)
test['hours-per-week'] = test['hours-per-week'].astype(int)
test['label'] = test['label'].map(lambda x:x.replace('.', ''))
merge train and test together
df = pd.concat([train, test], axis=0)
df.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Data visualization#
Bar Plot: A bar plot can be used to visualize the distribution of the target variable and other categorical variables. We can use it to compare the frequency of different categories.
# Sample a subset of the dataset
sample_df = df.sample(n=5000, random_state=42)
# Bar plot for target variable
bar_plot = alt.Chart(sample_df).mark_bar().encode(
x='label:N',
y='count():Q'
).properties(
title='Distribution of Target Variable'
)
bar_plot
df['label'].value_counts()
<=50K 49439
>50K 15682
Name: label, dtype: int64
In this case, there are two unique values in the ‘label’ column: “<=50K” and “>50K”. The count for “<=50K” is 37,155, indicating that there are 37,155 instances where the label is less than or equal to 50,000 (the currency unit is not specified in the given information).
On the other hand, the count for “>50K” is 11,687, indicating that there are 11,687 instances where the label is greater than 50,000.
This information suggests that the dataset contains a majority of individuals with a label of “<=50K” (37,155 instances) and a smaller number of individuals with a label of “>50K” (11,687 instances).
Histogram: A histogram can help explore the distributions of continuous variables such as age, education-num, capital-gain, capital-loss, and hours-per-week.
# Histogram for age
histogram = alt.Chart(sample_df).mark_bar().encode(
alt.X('age:Q', bin=True),
y='count():Q'
).properties(
title='Distribution of Age'
)
histogram
The histogram shows the distribution of ages in the dataset. It indicates that the highest frequency of individuals falls within the age range of approximately 31.6 to 38.9 years. The frequency remains relatively high in the surrounding age ranges as well, ranging from approximately 24.3 to 46.2 years.
Box Plot: A box plot can be used to visualize the distribution of continuous variables across different categories. For example, we can compare the distribution of age between different workclasses.
# Box plot for age across workclass
box_plot = alt.Chart(sample_df).mark_boxplot().encode(
x='workclass:N',
y='age:Q',
).properties(
title='Distribution of Age across Workclass'
)
box_plot
Scatter Plot: A scatter plot can help visualize the relationship between two continuous variables. For instance, we can examine the relationship between age and capital-gain.
# Scatter plot for age and capital-gain
scatter_plot = alt.Chart(sample_df).mark_circle().encode(
x='age:Q',
y='capital-gain:Q',
color='label:N'
).properties(
title='Relationship between Age and Capital Gain'
)
scatter_plot
There exists some relationship between age and label.
Grouped Bar Plot: A grouped bar plot can be used to compare the frequencies of a categorical variable across different groups. For example, we can compare the education levels among different income groups.
# Grouped bar plot for education and income groups
grouped_bar_plot = alt.Chart(sample_df).mark_bar().encode(
x='education:N',
y='count():Q',
color='label:N',
column='label:N'
).properties(
title='Distribution of Education Level across Income Groups'
)
grouped_bar_plot
It seems that higher education can generally lead to higher income.
Machine learning#
Transform all data into numeric ones#
# Iterate over columns of object datatype in the DataFrame
for col in df.select_dtypes(include='object').columns:
# Create a mapping dictionary with unique values as keys and corresponding numeric labels as values
value_mapping = dict(zip(df[col].unique(), range(df[col].nunique())))
# Map the values in the column to their corresponding numeric labels using the mapping dictionary
df[col] = df[col].map(value_mapping)
df['label'] = np.where(df['label'] == df['label'].unique()[0], 1, 0)
train = df.head(len(train))
test = df.tail(len(test))
X_train, X_test, y_train, y_test = train.drop('label', axis=1), test.drop('label', axis=1), train['label'], test['label']
The purpose of this code is to encode categorical variables with numeric labels. This transformation can be useful when working with machine learning algorithms that require numeric inputs. By mapping each unique value to a numeric label, the categorical variables are transformed into a format that can be easily processed by the algorithms.
Decision Trees#
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
clf = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20)
clf.fit(X_train, y_train)
fig = plt.figure(figsize=(200,100))#visualize the tree
plot_tree(
clf,
feature_names=X_train.columns,
filled=True
);
clf.score(X_train, y_train)
0.8449064832161174
clf.score(X_test, y_test)
0.8449017199017199
The model has low level of overfitting or underfitting, so interpretting its result is meaningful.
The tree starts with the entire dataset, which consists of 32,561 samples. At the root of the tree, the first split is made based on the “capital-gain” feature. If an individual’s capital gain is less than or equal to 5119.0, they follow the left branch; otherwise, they follow the right branch.
The left branch represents individuals with a capital gain less than or equal to 5119.0. Within this group, the tree further splits based on the “marital-status” feature. If an individual’s marital status is less than or equal to 0.5, they follow the left branch; otherwise, they follow the right branch.
The right branch represents individuals with a capital gain greater than 5119.0. Within this group, the tree splits based on the “capital-gain” feature again. If an individual’s capital gain is less than or equal to 7073.5, they follow the left branch; otherwise, they follow the right branch.
KNN#
from sklearn.neighbors import KNeighborsClassifier
reg = KNeighborsClassifier(n_neighbors=2)
reg.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=2)
reg.score(X_train, y_train)
0.8613678941064463
reg.score(X_test, y_test)
0.8613636363636363
The model is highly overfitting, so no more work is going to be added.
Logistics model#
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
lg.score(X_train, y_train)
0.7957679432449863
lg.score(X_test, y_test)
0.7957923832923833
The model has low level of overfitting or underfitting, so interpretting its result is meaningful.
coefficients = lg.coef_[0]
intercept = lg.intercept_[0]
for feature_name, coef in zip(X_train.columns, coefficients):
print(f"{feature_name}: {coef}")
print(f"Intercept: {intercept}")
age: 2.765496553248842e-05
workclass: 2.9604502063673026e-06
fnlwgt: 6.499810534462901e-06
education: 6.9485024111467315e-06
education-num: 6.3174650351243135e-06
marital-status: 1.7267699333154404e-06
occupation: 1.0679153784229773e-05
relationship: 4.937479617114498e-06
race: 8.244339835923701e-07
sex: 1.6757601566899827e-06
capital-gain: -0.0003211692305544448
capital-loss: -0.0007008467407639085
hours-per-week: 3.095780743855136e-05
native-country: 2.9178749614852293e-06
Intercept: 1.5086401875809778e-06
Here’s a simplified interpretation of the coefficients in plain words:
Age: The older a person is, the slightly more likely they are to belong to the positive class.
Workclass: The type of work a person does has a very small impact on whether they belong to the positive class or not.
Fnlwgt: A measure called fnlwgt doesn’t strongly influence whether a person belongs to the positive class or not.
Education: The level of education a person has only has a minor effect on whether they belong to the positive class or not.
Education-num: An alternative representation of education level doesn’t have a strong impact on whether a person belongs to the positive class or not.
Marital-status: Whether a person is married or not doesn’t have a significant influence on belonging to the positive class.
Occupation: The type of occupation a person has has a small influence on whether they belong to the positive class or not.
Relationship: The nature of a person’s relationship doesn’t strongly determine whether they belong to the positive class or not.
Race: A person’s race has a very small effect on whether they belong to the positive class or not.
Sex: Gender has a minimal impact on whether a person belongs to the positive class or not.
Capital-gain: Higher capital gains are associated with a lower likelihood of belonging to the positive class.
Capital-loss: Higher capital losses are associated with a lower likelihood of belonging to the positive class.
Hours-per-week: Working more hours per week slightly increases the chances of belonging to the positive class.
Native-country: The country of origin has a minimal impact on whether a person belongs to the positive class or not.
Intercept: The base log-odds of belonging to the positive class when all other factors are zero has a very small effect on the prediction.
Summary#
The machine learning models implemented in the project achieved moderate to good performance in predicting income levels. Decision trees showed the highest accuracy on both the training and test datasets, followed by logistic regression and K-nearest neighbors. The visualizations provided valuable insights into the dataset, highlighting the distribution of income levels and the relationships between different variables.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
“The Elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani, and Jerome Friedman: This book provides a comprehensive introduction to statistical learning methods, including those used for income prediction.
Kaggle Competitions: Kaggle is a popular platform for data science competitions. Participating in income prediction competitions on Kaggle can provide valuable insights into different modeling techniques and approaches.
Research Papers: There are numerous research papers published in the field of income prediction. Searching academic databases like Google Scholar or IEEE Xplore can help you find relevant papers based on your specific requirements.
Online Tutorials and Blogs: Many data science and machine learning websites offer tutorials and blog posts on income prediction. Websites like Towards Data Science, Medium, and Analytics Vidhya often have articles and tutorials on this topic.
Open-source Libraries and Documentation: Documentation and user guides of machine learning libraries such as scikit-learn and TensorFlow can provide detailed information on implementing various algorithms for income prediction.